Getting Started with R and RStudio

Author

Martin Schweinberger

Introduction

This tutorial introduces R and RStudio — the programming language and development environment used throughout LADAL. It is aimed at complete beginners with no prior programming experience, and walks through everything you need to get up and running: installing software, understanding the RStudio interface, setting up a reproducible project, and working with R for the first time.

R is a free, open-source programming language designed specifically for data analysis and statistics. It is the most widely used tool for quantitative research in linguistics, the social sciences, and the digital humanities — and for good reason. R gives you complete control over your analysis, produces publication-quality graphics, and keeps your work fully transparent and reproducible.

This tutorial will not turn you into an expert. Its goal is to give you a solid, well-structured foundation: to know where things are, how to think about R, and how to start doing real things with data. The rest of LADAL’s tutorials build from here.

Prerequisite Tutorials

This tutorial has no prerequisites — it is designed for complete beginners. However, the following background tutorials are helpful companions:

What This Tutorial Covers

Installing R and RStudio — getting everything set up on your computer
The RStudio interface — understanding the four panes and how to navigate them
R Projects and R Notebooks — setting up reproducible, well-organised workflows
R fundamentals — objects, functions, operators, and data types
Data structures — vectors, data frames, lists, and factors
Indexing and subsetting — accessing and filtering data
Working with data — loading, inspecting, and manipulating tabular data
Basic visualisation — creating your first plots with ggplot2
Getting help — where to turn when things go wrong

Why R?

Before diving in, it is worth briefly explaining why R is worth learning.

R is free and open-source — there are no licensing costs, ever. It is the dominant tool for statistical analysis in linguistics, psychology, and the social sciences. It has a vast ecosystem of over 20,000 contributed packages that extend its capabilities to cover almost any analytical task imaginable. Its reproducibility features — the ability to combine code, output, and prose in a single document — mean your analyses can be fully transparent and re-run by anyone. And its visualisation capabilities, particularly through ggplot2, are unmatched.

The learning curve is real but manageable. This tutorial gives you the foundation you need.

Preparation and Session Set-up

Install the packages used in this tutorial (only needed once):

Code

install.packages("dplyr")  
install.packages("ggplot2")  
install.packages("tidyr")  
install.packages("flextable")  
install.packages("readxl")  
install.packages("here")  
install.packages("checkdown")

Load the packages at the start of each session:

Code

library(dplyr)       # data manipulation  
library(ggplot2)     # data visualisation  
library(tidyr)       # data reshaping  
library(flextable)   # formatted tables  
library(here)        # robust file paths  
library(checkdown)   # interactive exercises

Installing R and RStudio

Section Overview

What you’ll learn: How to install R and RStudio on your computer

Why it matters: You need both installed to follow any LADAL tutorial

Time: ~15–30 minutes (mostly waiting for downloads)

R and RStudio are two separate pieces of software that work together. Think of R as the engine and RStudio as the car — you need both, and you interact almost exclusively with RStudio.

Installing R

R must be installed before RStudio. Visit cran.r-project.org and select the download for your operating system:

Windows: click Download R for Windows → base → Download R x.x.x for Windows
Mac: click Download R for macOS → select the version matching your macOS
Linux: follow the instructions for your distribution

Run the downloaded installer and accept the default settings throughout.

Keeping R Up to Date

R releases a new version approximately once a year. To check your current version, run R.version$version.string in the console. To update on Windows, the installr package automates the process:

Code

install.packages("installr")  
library(installr)  
updateR()

On Mac, download the new version from CRAN and install over the existing version.

Installing RStudio

Visit posit.co/download/rstudio-desktop and download the free RStudio Desktop version for your operating system. Run the installer and accept the defaults.

After installation, open RStudio (not R directly). RStudio will automatically detect your R installation.

The RStudio Interface

Section Overview

What you’ll learn: How to navigate the four panes of RStudio and what each one does

Key concept: The difference between the Console (run immediately) and the Script Editor (save and reuse)

When you first open RStudio, you will see an interface divided into panes. The screenshot below shows a typical RStudio session with all four panes visible.

RStudio has four main panes:

Pane 1: Script Editor (top left)

This is where you write and save code. Code typed here does not run automatically — you must explicitly execute it. This is where all your analysis lives.

To run a line of code from the Script Editor, place your cursor on that line and press Ctrl + Enter (Windows/Linux) or Cmd + Enter (Mac). To run a highlighted block, select the code first and then press the same shortcut.

Pane 2: Console (bottom left)

This is where R executes code and displays text output. When you run code from the Script Editor, it appears here. You can also type directly into the Console and press Enter to run commands immediately.

Use the Console for quick experiments. Use the Script Editor for anything you want to keep.

Console Shortcuts

Press the Up arrow in the Console to recall previous commands
Type the beginning of a command and press Tab to autocomplete
Type ?function_name to open the help page for any function

Pane 3: Environment and History (top right)

The Environment tab shows all objects currently loaded in your R session — data frames, variables, vectors, and so on. Clicking on a data frame here opens a spreadsheet-style viewer.

The History tab logs all commands you have run in the current session.

Pane 4: Files, Plots, Help, Packages (bottom right)

This multi-tab pane contains:

Files: Browse your project folder
Plots: View graphics output here
Help: Documentation for functions and packages (also accessible via ?)
Packages: See which packages are installed and loaded
Viewer: Preview rendered documents

Projects and Notebooks

Section Overview

What you’ll learn: How to set up a reproducible project in RStudio; what an R Notebook is and why to use one

Key concept: An R Project keeps all your files, code, and data together in one self-contained folder

Good organisation before you start coding saves a great deal of trouble later. This section walks through the recommended setup.

Step 1: Create a Project Folder

Before opening RStudio, create a folder on your computer for your project. Inside it, create the following sub-folders:

my_project/  
├── data/          ← raw and processed data files  
├── images/        ← figures saved from R  
├── tables/        ← tables exported from R  
└── docs/          ← notes, reports, and output documents

Step 2: Create an R Project

An R Project tells RStudio that a folder is a self-contained project. It sets the working directory automatically (so file paths are predictable) and keeps your project’s history and settings separate from other projects.

To create an R Project:

Open RStudio
Click File → New Project
Select Existing Directory
Navigate to your project folder and click Create Project

RStudio will restart and you will see your project name in the top-right corner. You are now working inside your project.

Always Work Inside an R Project

When you open RStudio, always open your project first (either by double-clicking the .Rproj file in your folder, or via File → Open Project). This ensures file paths work correctly and your environment is isolated.

Step 3: Create an R Notebook

An R Notebook (.Rmd or .qmd file) combines prose, code, and output in a single document. This is the standard format for LADAL tutorials and is highly recommended for your own analyses — it keeps your thinking and your code together.

To create an R Notebook:

Click File → New File → R Notebook
Give it a meaningful title
Save it in your project folder

The notebook uses R Markdown — a simple formatting syntax explained below.

R Markdown Basics

R Markdown lets you write formatted prose alongside executable code. Here is a quick reference:

# Heading 1  
## Heading 2  
### Heading 3  
  
**bold text**  
*italic text*  
`inline code`  
  
- bullet point  
- another bullet  
  
1. numbered item  
2. another item  
  
[link text](https://url.com)

Code is written inside code chunks (fenced with triple backticks):

::: {.cell}

```{.r .cell-code}
# your R code here  
2 + 2  
```

::: {.cell-output .cell-output-stdout}

```
[1] 4
```


:::
:::

When you click Knit (or Render in Quarto), R Markdown executes all code chunks and weaves the output together with your prose into a finished HTML, PDF, or Word document.

Reproducibility

The power of R Notebooks is reproducibility: your entire analysis — every number, table, and figure — is regenerated from scratch each time you render the document. Anyone with your .Rmd file and data can reproduce your results exactly.

R Fundamentals

Section Overview

What you’ll learn: The core building blocks of R — objects, functions, operators, and assignment

Key concepts: Everything in R is an object; everything you do in R uses a function

Setting Up a Session

At the top of any script or notebook, set global options and load packages. This makes your session reproducible from the very first line.

Code

# Global options  
options(stringsAsFactors = FALSE)   # keep character variables as text  
options(scipen = 100)               # avoid scientific notation  
options(max.print = 100)            # limit printed output  
  
# Load packages  
library(dplyr)  
library(ggplot2)

Objects and Assignment

In R, everything is stored as an object. You create objects using the assignment operator <-:

Code

# Create a numeric object  
my_number <- 42  
  
# Create a character (text) object  
my_name <- "linguistics"  
  
# Create a logical object  
is_true <- TRUE  
  
# View an object by typing its name  
my_number

[1] 42

Code

my_name

[1] "linguistics"

Code

is_true

[1] TRUE

Naming Objects

Good object names are:
- lowercase with underscores for spaces: word_count, not Word Count
- descriptive: reaction_time_ms is better than x
- not starting with a number: data1 is valid; 1data is not
- not reserved words: don’t use c, t, df, mean, TRUE, FALSE, NULL as object names

R is case-sensitive: MyData and mydata are different objects.

Functions

A function takes one or more inputs (called arguments), does something, and returns an output. Functions are called by name followed by parentheses containing the arguments:

Code

# sqrt() takes a number and returns its square root  
sqrt(144)

[1] 12

Code

# round() rounds a number to a specified number of decimal places  
round(3.14159, digits = 2)

[1] 3.14

Code

# nchar() counts the characters in a string  
nchar("linguistics")

[1] 11

Code

# paste() joins strings together  
paste("language", "data", "analysis", sep = "-")

[1] "language-data-analysis"

You can nest functions — the inner function runs first:

Code

# Round the square root of 2 to 3 decimal places  
round(sqrt(2), digits = 3)

[1] 1.414

Operators

R provides standard arithmetic and logical operators:

Code

# Arithmetic operators  
10 + 3    # addition

[1] 13

Code

10 - 3    # subtraction

[1] 7

Code

10 * 3    # multiplication

[1] 30

Code

10 / 3    # division

[1] 3.333333

Code

10 ^ 2    # exponentiation

[1] 100

Code

10 %% 3   # modulo (remainder)

[1] 1

Code

# Comparison operators (return TRUE or FALSE)  
5 > 3     # greater than

[1] TRUE

Code

5 < 3     # less than

[1] FALSE

Code

5 == 5    # equal to (note: double equals!)

[1] TRUE

Code

5 != 3    # not equal to

[1] TRUE

Code

5 >= 5    # greater than or equal to

[1] TRUE

Code

# Logical operators  
TRUE & FALSE   # AND

[1] FALSE

Code

TRUE | FALSE   # OR

[1] TRUE

Code

!TRUE          # NOT

[1] FALSE

= vs ==

One of the most common beginner errors: = is used for assignment (interchangeable with <- in most cases, though <- is preferred); == tests whether two things are equal. 5 = 3 will produce an error; 5 == 3 returns FALSE.

Exercises: R Fundamentals

Q1. What does the assignment operator <- do?

Q2. You run my_var <- 10. What will my_var * 3 + 1 return?

Q3. Which of the following is NOT a valid object name in R?

Data Types

Section Overview

What you’ll learn: The six basic data types in R and why they matter

Key concept: The type of your data determines which operations are valid

Every object in R has a type (also called a class). The four types you will encounter most often are:

Code

# Numeric (continuous numbers)  
age <- 28.5  
class(age)

[1] "numeric"

Code

# Integer (whole numbers; the L suffix forces integer type)  
count <- 42L  
class(count)

[1] "integer"

Code

# Character (text; always in quotes)  
language <- "English"  
class(language)

[1] "character"

Code

# Logical (TRUE or FALSE only)  
is_native <- TRUE  
class(is_native)

[1] "logical"

You can check the type of any object with class() or typeof(), and test for specific types:

Code

is.numeric(age)

[1] TRUE

Code

is.character(language)

[1] TRUE

Code

is.logical(is_native)

[1] TRUE

You can convert between types using coercion functions:

Code

# Character to numeric  
as.numeric("3.14")

[1] 3.14

Code

# Numeric to character  
as.character(42)

[1] "42"

Code

# Numeric to logical (0 = FALSE, everything else = TRUE)  
as.logical(0)

[1] FALSE

Code

as.logical(1)

[1] TRUE

Code

as.logical(-99)

[1] TRUE

Coercion Failures

When R cannot coerce a value, it introduces NA (missing value) with a warning:

Code

as.numeric("hello")  # "hello" cannot be a number → NA

Warning: NAs introduced by coercion

[1] NA

NA stands for Not Available and represents missing data. It propagates through calculations — any arithmetic involving NA returns NA unless specifically handled.

Data Structures

Section Overview

What you’ll learn: How R organises collections of data — vectors, data frames, lists, and factors

Key concept: Vectors are the fundamental unit; data frames are collections of equal-length vectors

Vectors

A vector is a sequence of values of the same type. Vectors are created with c() (short for combine):

Code

# Numeric vector  
word_lengths <- c(3, 5, 2, 8, 4, 6, 1)  
  
# Character vector  
languages <- c("English", "German", "Mandarin", "Arabic")  
  
# Logical vector  
is_content_word <- c(TRUE, TRUE, FALSE, TRUE, FALSE)

You can perform operations on entire vectors at once — R applies them element-by-element:

Code

# Arithmetic on a vector  
word_lengths * 2

[1]  6 10  4 16  8 12  2

Code

# Logical comparison on a vector  
word_lengths > 4

[1] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE

Code

# Common summary functions  
length(word_lengths)     # number of elements

[1] 7

Code

sum(word_lengths)        # sum

[1] 29

Code

mean(word_lengths)       # mean

[1] 4.142857

Code

sd(word_lengths)         # standard deviation

[1] 2.410295

Code

min(word_lengths)        # minimum

[1] 1

Code

max(word_lengths)        # maximum

[1] 8

Code

range(word_lengths)      # min and max together

[1] 1 8

Sequences and Repetitions

Code

# Create a sequence with :  
1:10

 [1]  1  2  3  4  5  6  7  8  9 10

Code

# Create a sequence with seq()  
seq(from = 0, to = 1, by = 0.25)

[1] 0.00 0.25 0.50 0.75 1.00

Code

seq(from = 1, to = 100, length.out = 5)

[1]   1.00  25.75  50.50  75.25 100.00

Code

# Repeat values with rep()  
rep("yes", times = 3)

[1] "yes" "yes" "yes"

Code

rep(c("A", "B"), times = 4)

[1] "A" "B" "A" "B" "A" "B" "A" "B"

Code

rep(c("A", "B"), each = 4)

[1] "A" "A" "A" "A" "B" "B" "B" "B"

Factors

A factor is a special type of vector for categorical variables. Factors have a fixed set of levels (categories) and are essential for grouping in analyses and plots.

Code

# Create a factor  
register <- factor(c("Formal", "Informal", "Formal", "ReadAloud", "Informal"))  
  
# Inspect the factor  
register

[1] Formal    Informal  Formal    ReadAloud Informal 
Levels: Formal Informal ReadAloud

Code

levels(register)    # the unique categories

[1] "Formal"    "Informal"  "ReadAloud"

Code

nlevels(register)   # how many categories

[1] 3

Code

table(register)     # frequency of each level

register
   Formal  Informal ReadAloud 
        2         2         1

By default, levels are ordered alphabetically. You can specify a custom order:

Code

# Custom level order (important for plots and models)  
register_ordered <- factor(  
  c("Formal", "Informal", "Formal", "ReadAloud", "Informal"),  
  levels = c("Formal", "ReadAloud", "Informal")  
)  
  
levels(register_ordered)

[1] "Formal"    "ReadAloud" "Informal"

Data Frames

A data frame is R’s equivalent of a spreadsheet — a table where each column is a vector of the same length. Data frames are the most common way to store linguistic data.

Code

# Create a data frame from scratch  
speakers <- data.frame(  
  ID          = 1:6,  
  Name        = c("Alice", "Bob", "Carol", "David", "Eve", "Frank"),  
  L1          = c("English", "German", "English", "Mandarin", "English", "Arabic"),  
  Age         = c(24, 31, 28, 22, 35, 27),  
  Proficiency = factor(c("Advanced", "Intermediate", "Advanced",  
                          "Beginner", "Intermediate", "Advanced"),  
                       levels = c("Beginner", "Intermediate", "Advanced"))  
)  
  
# Inspect the data frame  
speakers

  ID  Name       L1 Age  Proficiency
1  1 Alice  English  24     Advanced
2  2   Bob   German  31 Intermediate
3  3 Carol  English  28     Advanced
4  4 David Mandarin  22     Beginner
5  5   Eve  English  35 Intermediate
6  6 Frank   Arabic  27     Advanced

Key functions for inspecting a data frame:

Code

nrow(speakers)         # number of rows (observations)

[1] 6

Code

ncol(speakers)         # number of columns (variables)

[1] 5

Code

dim(speakers)          # both at once

[1] 6 5

Code

names(speakers)        # column names

[1] "ID"          "Name"        "L1"          "Age"         "Proficiency"

Code

str(speakers)          # structure: types and first values

'data.frame':   6 obs. of  5 variables:
 $ ID         : int  1 2 3 4 5 6
 $ Name       : chr  "Alice" "Bob" "Carol" "David" ...
 $ L1         : chr  "English" "German" "English" "Mandarin" ...
 $ Age        : num  24 31 28 22 35 27
 $ Proficiency: Factor w/ 3 levels "Beginner","Intermediate",..: 3 2 3 1 2 3

Code

head(speakers, n = 3)  # first 3 rows

  ID  Name      L1 Age  Proficiency
1  1 Alice English  24     Advanced
2  2   Bob  German  31 Intermediate
3  3 Carol English  28     Advanced

Code

tail(speakers, n = 2)  # last 2 rows

  ID  Name      L1 Age  Proficiency
5  5   Eve English  35 Intermediate
6  6 Frank  Arabic  27     Advanced

Code

summary(speakers)      # summary statistics per column

       ID           Name                L1                 Age       
 Min.   :1.00   Length:6           Length:6           Min.   :22.00  
 1st Qu.:2.25   Class :character   Class :character   1st Qu.:24.75  
 Median :3.50   Mode  :character   Mode  :character   Median :27.50  
 Mean   :3.50                                         Mean   :27.83  
 3rd Qu.:4.75                                         3rd Qu.:30.25  
 Max.   :6.00                                         Max.   :35.00  
       Proficiency
 Beginner    :1   
 Intermediate:2   
 Advanced    :3

Lists

A list is the most flexible data structure — it can hold objects of different types and lengths, including other lists.

Code

# Create a list with mixed types  
my_list <- list(  
  name     = "Study 1",  
  n        = 30,  
  groups   = c("Control", "Treatment"),  
  complete = TRUE  
)  
  
# Access list elements with $ or [[]]  
my_list$name

[1] "Study 1"

Code

my_list[["n"]]

[1] 30

Lists are commonly returned by statistical model functions (e.g., lm() returns a list). You rarely create them from scratch but frequently need to extract elements from them.

Exercises: Data Structures

Q1. You run x <- c(1, 2, "three", 4). What type will x be?

Q2. What is the difference between a factor and a character vector?

Q3. What does dim(df) return for a data frame with 50 rows and 4 columns?

Indexing and Subsetting

Section Overview

What you’ll learn: How to access specific elements, rows, columns, and subsets of your data

Key concept: Square brackets [ ] select by position; $ selects columns by name; dplyr verbs filter by condition

Extracting exactly the data you need is one of the most fundamental R skills.

Indexing Vectors

Use square brackets [ ] with a position number (index) to extract elements from a vector. R indexing starts at 1 (not 0 as in Python).

Code

languages <- c("English", "German", "Mandarin", "Arabic", "French")  
  
# Extract a single element  
languages[1]       # first element

[1] "English"

Code

languages[4]       # fourth element

[1] "Arabic"

Code

# Extract multiple elements  
languages[c(1, 3)] # first and third

[1] "English"  "Mandarin"

Code

languages[2:4]     # second through fourth

[1] "German"   "Mandarin" "Arabic"

Code

# Exclude elements (negative indexing)  
languages[-2]      # everything except the second element

[1] "English"  "Mandarin" "Arabic"   "French"

Code

languages[-c(1,5)] # everything except first and fifth

[1] "German"   "Mandarin" "Arabic"

Code

# Logical indexing  
word_lengths <- c(3, 5, 2, 8, 4, 6, 1)  
word_lengths[word_lengths > 4]          # elements greater than 4

[1] 5 8 6

Code

word_lengths[word_lengths == min(word_lengths)]  # the minimum value

[1] 1

Indexing Data Frames

Data frames have two dimensions: df[row, column]. Leave one blank to select all rows or all columns.

Code

# Using the speakers data frame from earlier  
  
# Single cell: row 2, column 3  
speakers[2, 3]

[1] "German"

Code

# Entire row 1  
speakers[1, ]

  ID  Name      L1 Age Proficiency
1  1 Alice English  24    Advanced

Code

# Entire column 3 (returns a vector)  
speakers[, 3]

[1] "English"  "German"   "English"  "Mandarin" "English"  "Arabic"

Code

# Column by name using $  
speakers$Age

[1] 24 31 28 22 35 27

Code

speakers$L1

[1] "English"  "German"   "English"  "Mandarin" "English"  "Arabic"

Code

# Multiple rows and columns  
speakers[1:3, c("Name", "Age")]

   Name Age
1 Alice  24
2   Bob  31
3 Carol  28

Subsetting with `dplyr`

While base R indexing works, the dplyr package provides cleaner, more readable syntax for filtering and selecting data. These are the two most important dplyr verbs for subsetting:

Code

# filter() keeps rows that meet a condition  
speakers |>  
  dplyr::filter(L1 == "English")

  ID  Name      L1 Age  Proficiency
1  1 Alice English  24     Advanced
2  3 Carol English  28     Advanced
3  5   Eve English  35 Intermediate

Code

# select() keeps specified columns  
speakers |>  
  dplyr::select(Name, Age, Proficiency)

   Name Age  Proficiency
1 Alice  24     Advanced
2   Bob  31 Intermediate
3 Carol  28     Advanced
4 David  22     Beginner
5   Eve  35 Intermediate
6 Frank  27     Advanced

Code

# Combine both  
speakers |>  
  dplyr::filter(Age < 30) |>  
  dplyr::select(Name, L1, Age)

   Name       L1 Age
1 Alice  English  24
2 Carol  English  28
3 David Mandarin  22
4 Frank   Arabic  27

The Pipe Operator |>

The pipe |> (from the magrittr/dplyr packages) passes the result on the left to the function on the right. It lets you chain operations in a readable left-to-right sequence instead of nesting functions:

# Without pipe (hard to read)  
select(filter(speakers, Age < 30), Name, Age)  
  
# With pipe (reads like a sentence)  
speakers |> filter(Age < 30) |> select(Name, Age)

R 4.1+ also has a native pipe |> that works similarly. LADAL tutorials use |>.

Exercises: Indexing

Q1. Given v <- c(10, 20, 30, 40, 50), what does v[c(2, 4)] return?

Q2. How do you use dplyr::filter() to keep only rows where the column Proficiency equals "Advanced"?

Working with Data

Section Overview

What you’ll learn: How to load data from files, inspect it, and perform common data manipulation operations

Key functions: read.csv(), readxl::read_excel(), dplyr::mutate(), dplyr::group_by(), dplyr::summarise()

Loading Data

From CSV

Code

# Base R  
my_data <- read.csv("data/my_file.csv")  
  
# Using here() for robust paths (recommended)  
my_data <- read.csv(here::here("data", "my_file.csv"))  
  
# Tidyverse readr (slightly faster, better defaults)  
my_data <- readr::read_csv(here::here("data", "my_file.csv"))

From Excel

Code

library(readxl)  
my_data <- readxl::read_excel(here::here("data", "my_file.xlsx"))  
  
# Specify a sheet  
my_data <- readxl::read_excel(here::here("data", "my_file.xlsx"), sheet = "Sheet2")

Saving Data

Code

# Save as CSV  
write.csv(my_data, here::here("data", "processed_data.csv"), row.names = FALSE)  
  
# Save as R object (preserves factors and other R-specific attributes)  
saveRDS(my_data, here::here("data", "processed_data.rds"))  
  
# Load an RDS file  
my_data <- readRDS(here::here("data", "processed_data.rds"))

Manipulating Data with dplyr

We will use a simulated linguistic dataset to demonstrate the key dplyr operations. The dataset contains reaction times and accuracy from a lexical decision task:

Code

set.seed(42)  
n <- 60  
  
lex_data <- data.frame(  
  Participant    = rep(1:20, each = 3),  
  Condition      = rep(c("High_Freq", "Low_Freq", "Pseudoword"), times = 20),  
  RT_ms          = c(  
    rnorm(20, mean = 480, sd = 55),   # High frequency: fast  
    rnorm(20, mean = 610, sd = 70),   # Low frequency: slower  
    rnorm(20, mean = 730, sd = 80)    # Pseudowords: slowest  
  ),  
  Accurate       = sample(c(TRUE, FALSE), n, replace = TRUE, prob = c(0.9, 0.1))  
) |>  
  dplyr::mutate(Condition = factor(Condition,  
                                   levels = c("High_Freq", "Low_Freq", "Pseudoword")))

`mutate()` — Add or Modify Columns

Code

# Add a new column converting RT to seconds  
lex_data <- lex_data |>  
  dplyr::mutate(  
    RT_s         = RT_ms / 1000,  
    RT_log       = log(RT_ms),  
    Fast_respons = RT_ms < 500  
  )  
  
head(lex_data)

  Participant  Condition    RT_ms Accurate      RT_s   RT_log Fast_respons
1           1  High_Freq 555.4027     TRUE 0.5554027 6.319693        FALSE
2           1   Low_Freq 448.9416     TRUE 0.4489416 6.106893         TRUE
3           1 Pseudoword 499.9721     TRUE 0.4999721 6.214552         TRUE
4           2  High_Freq 514.8074     TRUE 0.5148074 6.243793        FALSE
5           2   Low_Freq 502.2348     TRUE 0.5022348 6.219068        FALSE
6           2 Pseudoword 474.1632     TRUE 0.4741632 6.161551         TRUE

`group_by()` and `summarise()` — Aggregate by Group

Code

lex_data |>  
  dplyr::group_by(Condition) |>  
  dplyr::summarise(  
    n          = n(),  
    M_RT       = round(mean(RT_ms), 1),  
    SD_RT      = round(sd(RT_ms), 1),  
    Accuracy   = round(mean(Accurate) * 100, 1),  
    .groups    = "drop"  
  ) |>  
  flextable() |>  
  flextable::set_table_properties(width = .8, layout = "autofit") |>  
  flextable::theme_zebra() |>  
  flextable::fontsize(size = 12) |>  
  flextable::fontsize(size = 12, part = "header") |>  
  flextable::align_text_col(align = "center") |>  
  flextable::set_caption(caption = "Reaction times and accuracy by condition in the lexical decision task.") |>  
  flextable::border_outer()

Condition	n	M_RT	SD_RT	Accuracy
High_Freq	20	592.9	125.9	90
Low_Freq	20	605.0	117.9	80
Pseudoword	20	613.7	135.6	100

`arrange()` — Sort Rows

Code

# Sort by RT (ascending)  
lex_data |>  
  dplyr::arrange(RT_ms) |>  
  head(5)

  Participant  Condition    RT_ms Accurate      RT_s   RT_log Fast_respons
1           6 Pseudoword 333.8950     TRUE 0.3338950 5.810826         TRUE
2           7  High_Freq 345.7743     TRUE 0.3457743 5.845786         TRUE
3           5  High_Freq 403.6127     TRUE 0.4036127 6.000456         TRUE
4          13 Pseudoword 441.0055     TRUE 0.4410055 6.089057         TRUE
5           1   Low_Freq 448.9416     TRUE 0.4489416 6.106893         TRUE

Code

# Sort descending  
lex_data |>  
  dplyr::arrange(desc(RT_ms)) |>  
  head(5)

  Participant  Condition    RT_ms Accurate      RT_s   RT_log Fast_respons
1          18   Low_Freq 856.0582     TRUE 0.8560582 6.752338        FALSE
2          16 Pseudoword 845.5281     TRUE 0.8455281 6.739961        FALSE
3          15  High_Freq 790.6531     TRUE 0.7906531 6.672859        FALSE
4          19 Pseudoword 784.3431     TRUE 0.7843431 6.664847        FALSE
5          17   Low_Freq 782.4518     TRUE 0.7824518 6.662432        FALSE

`rename()` and `relocate()`

Code

# Rename columns  
lex_data |>  
  dplyr::rename(ReactionTime = RT_ms, Correct = Accurate) |>  
  head(3)

  Participant  Condition ReactionTime Correct      RT_s   RT_log Fast_respons
1           1  High_Freq     555.4027    TRUE 0.5554027 6.319693        FALSE
2           1   Low_Freq     448.9416    TRUE 0.4489416 6.106893         TRUE
3           1 Pseudoword     499.9721    TRUE 0.4999721 6.214552         TRUE

`count()` — Quick Frequency Tables

Code

# How many observations per condition?  
lex_data |>  
  dplyr::count(Condition)

   Condition  n
1  High_Freq 20
2   Low_Freq 20
3 Pseudoword 20

Code

# Cross-tabulate condition and accuracy  
lex_data |>  
  dplyr::count(Condition, Accurate)

   Condition Accurate  n
1  High_Freq    FALSE  2
2  High_Freq     TRUE 18
3   Low_Freq    FALSE  4
4   Low_Freq     TRUE 16
5 Pseudoword     TRUE 20

Handling Missing Values

Code

# Check for missing values  
sum(is.na(lex_data$RT_ms))

[1] 0

Code

colSums(is.na(lex_data))

 Participant    Condition        RT_ms     Accurate         RT_s       RT_log 
           0            0            0            0            0            0 
Fast_respons 
           0

Code

# Remove rows with any missing value  
lex_data_clean <- lex_data |>  
  tidyr::drop_na()  
  
# Replace NA with a value (e.g., mean imputation — use cautiously!)  
lex_data |>  
  dplyr::mutate(RT_ms = ifelse(is.na(RT_ms), mean(RT_ms, na.rm = TRUE), RT_ms))

   Participant  Condition    RT_ms Accurate      RT_s   RT_log Fast_respons
1            1  High_Freq 555.4027     TRUE 0.5554027 6.319693        FALSE
2            1   Low_Freq 448.9416     TRUE 0.4489416 6.106893         TRUE
3            1 Pseudoword 499.9721     TRUE 0.4999721 6.214552         TRUE
4            2  High_Freq 514.8074     TRUE 0.5148074 6.243793        FALSE
5            2   Low_Freq 502.2348     TRUE 0.5022348 6.219068        FALSE
6            2 Pseudoword 474.1632     TRUE 0.4741632 6.161551         TRUE
7            3  High_Freq 563.1337    FALSE 0.5631337 6.333517        FALSE
8            3   Low_Freq 474.7938    FALSE 0.4747938 6.162881         TRUE
9            3 Pseudoword 591.0133     TRUE 0.5910133 6.381839        FALSE
10           4  High_Freq 476.5507     TRUE 0.4765507 6.166574         TRUE
11           4   Low_Freq 551.7678    FALSE 0.5517678 6.313127        FALSE
12           4 Pseudoword 605.7655     TRUE 0.6057655 6.406493        FALSE
13           5  High_Freq 403.6127     TRUE 0.4036127 6.000456         TRUE
14           5   Low_Freq 464.6666    FALSE 0.4646666 6.141320         TRUE
 [ reached 'max' / getOption("max.print") -- omitted 46 rows ]

Exercises: Working with Data

Q1. What does dplyr::mutate() do?

Q2. You want the mean RT for each participant across all conditions. Which dplyr pipeline is correct?

Basic Visualisation with ggplot2

Section Overview

What you’ll learn: How to create basic plots using ggplot2; the layered grammar of graphics

Key concept: Every ggplot2 plot is built by adding layers — data, aesthetics, geometries, and themes

ggplot2 is R’s most powerful and widely used plotting package. It is based on the Grammar of Graphics: the idea that every plot can be described by a consistent set of components.

The Grammar of Graphics

Every ggplot2 plot has at least three components:

Data: the data frame containing your variables
Aesthetics (aes()): which variables map to which visual properties (x axis, y axis, colour, size, shape)
Geometry (geom_*()): how the data are visually represented (points, bars, lines, boxes)

Additional optional components include scales, facets, themes, and labels.

ggplot(data = my_data, aes(x = variable1, y = variable2)) +  
  geom_point() +  
  theme_bw() +  
  labs(title = "My plot", x = "X label", y = "Y label")

Histograms

Code

ggplot(lex_data, aes(x = RT_ms, fill = Condition)) +  
  geom_histogram(bins = 20, color = "white", alpha = 0.7) +  
  facet_wrap(~ Condition, ncol = 1) +  
  scale_fill_manual(values = c("steelblue", "tomato", "seagreen")) +  
  theme_bw() +  
  theme(legend.position = "none", panel.grid.minor = element_blank()) +  
  labs(title = "Distribution of reaction times by condition",  
       x = "Reaction time (ms)", y = "Count")

Boxplots

Code

ggplot(lex_data, aes(x = Condition, y = RT_ms, fill = Condition)) +  
  geom_boxplot(alpha = 0.7, outlier.color = "gray40") +  
  stat_summary(fun = mean, geom = "point",  
               shape = 18, size = 3, color = "black") +  
  scale_fill_manual(values = c("steelblue", "tomato", "seagreen")) +  
  theme_bw() +  
  theme(legend.position = "none", panel.grid.minor = element_blank()) +  
  labs(title = "Reaction times by condition",  
       subtitle = "Diamond = group mean; box = median and IQR",  
       x = "Condition", y = "Reaction time (ms)")

Bar Charts

Code

lex_data |>  
  dplyr::group_by(Condition) |>  
  dplyr::summarise(M_RT = mean(RT_ms),  
                   SE   = sd(RT_ms) / sqrt(n()),  
                   .groups = "drop") |>  
  ggplot(aes(x = Condition, y = M_RT, fill = Condition)) +  
  geom_col(alpha = 0.8, width = 0.6) +  
  geom_errorbar(aes(ymin = M_RT - SE, ymax = M_RT + SE),  
                width = 0.2, linewidth = 0.8) +  
  scale_fill_manual(values = c("steelblue", "tomato", "seagreen")) +  
  theme_bw() +  
  theme(legend.position = "none", panel.grid.minor = element_blank()) +  
  labs(title = "Mean reaction time by condition",  
       subtitle = "Error bars = ±1 SE",  
       x = "Condition", y = "Mean RT (ms)")

Scatter Plots

Code

ggplot(lex_data, aes(x = Participant, y = RT_ms, color = Condition)) +  
  geom_point(alpha = 0.7, size = 2) +  
  scale_color_manual(values = c("steelblue", "tomato", "seagreen")) +  
  theme_bw() +  
  theme(panel.grid.minor = element_blank()) +  
  labs(title = "Individual RT observations by participant and condition",  
       x = "Participant ID", y = "Reaction time (ms)",  
       color = "Condition")

Saving Plots

Code

# Save the most recently displayed plot  
ggsave(  
  filename = here::here("images", "my_plot.png"),  
  width    = 8,  
  height   = 5,  
  dpi      = 300  
)  
  
# Save a named plot object  
my_plot <- ggplot(lex_data, aes(x = RT_ms)) + geom_histogram()  
  
ggsave(  
  plot     = my_plot,  
  filename = here::here("images", "histogram.pdf"),  
  width    = 6,  
  height   = 4  
)

ggplot2 Quick Tips

Add theme_bw() for a clean white background (LADAL standard)
Add theme(panel.grid.minor = element_blank()) to remove minor gridlines
Use scale_color_manual() / scale_fill_manual() to control colours
Use facet_wrap(~ variable) to create small multiples
Use labs() to set title, subtitle, and axis labels
Use + coord_flip() to swap x and y axes (useful for long category names)

Exercises: Visualisation

Q1. In ggplot2, what does aes() control?

Q2. Which geom_*() function would you use to create a histogram?

Getting Help

Section Overview

What you’ll learn: How to find help efficiently when you are stuck — both within R and online

Every R user gets stuck regularly. Knowing where to look for help is as important as knowing R itself.

Help Within R

Code

# Help page for a specific function  
?mean  
help(mean)  
  
# Search for functions related to a keyword  
??regression  
apropos("filter")  
  
# See a function's arguments  
args(ggplot)  
  
# See examples of a function in action  
example(boxplot)

RStudio’s Help tab (bottom right pane) renders help pages with formatted descriptions, argument lists, and examples.

Vignettes

Many packages include vignettes — detailed guides that show how to use the package end-to-end. These are often more useful than the function-level help pages:

Code

# List all vignettes for a package  
vignette(package = "dplyr")  
  
# Open a specific vignette  
vignette("dplyr")  
vignette("ggplot2-specs")

Reading Error Messages

Error messages are your friend — they tell you exactly what went wrong. Common error patterns:

Common Errors and What They Mean

object 'x' not found
→ The object x does not exist in your environment. Did you run the line that creates it? Is it spelled correctly (case-sensitive)?

could not find function "ggplot"
→ The package containing this function is not loaded. Did you run library(ggplot2)?

Error in read.csv("data.csv") : cannot open file
→ R cannot find the file. Check your working directory (getwd()), use here::here(), and check for typos in the filename.

non-numeric argument to binary operator
→ You tried to do arithmetic on a character string. Check the type of your object with class().

NAs introduced by coercion
→ R tried to convert a character to numeric but could not. The unconvertible values became NA. Inspect the affected column for unexpected text.

object of type 'closure' is not subsettable
→ You tried to index a function as if it were a data frame (e.g., mean[1]). Check whether you forgot parentheses somewhere.

Searching Online

The R community is enormous and helpful. When you encounter an error:

Copy the exact error message and paste it into Google with “R” at the start
Stack Overflow (stackoverflow.com) has answers to most common R questions
RStudio Community (community.rstudio.com) is welcoming to beginners
CRAN package pages list vignettes, reference manuals, and NEWS files
Package websites (e.g., dplyr.tidyverse.org) have well-structured guides

Writing a Good Question

If you need to ask for help, always provide:
- A minimal reproducible example — the smallest piece of code that demonstrates the problem
- Your session info: sessionInfo()
- The exact error message (copy-paste, do not retype)
- What you expected to happen vs. what actually happened

The reprex package helps format reproducible examples: install.packages("reprex")

Key Online Resources

Resource	URL	Why useful
R for Data Science	r4ds.hadley.nz	Free online book; the best comprehensive introduction to R and the tidyverse
RStudio Cheatsheets	posit.co/resources/cheatsheets	One-page quick references for popular packages (dplyr, ggplot2, RMarkdown, etc.)
CRAN Task Views	cran.r-project.org/web/views	Curated lists of R packages by topic (linguistics, NLP, spatial, etc.)
Stack Overflow [r]	stackoverflow.com/questions/tagged/r	Answers to nearly every R question; search before posting
Tidyverse documentation	tidyverse.org	Official documentation for dplyr, ggplot2, tidyr, readr, and more
ggplot2 documentation	ggplot2.tidyverse.org	Function reference, articles, and extension gallery
R Graph Gallery	r-graph-gallery.com	Hundreds of example plots with full reproducible code

Best Practices

Section Overview

What you’ll learn: Habits and conventions that make your R code more readable, reproducible, and robust

Good coding habits matter more the longer your projects become. These practices are worth building from day one.

Code Style

Comment your code liberally: # This filters to English speakers only
Use consistent naming: word_count not WordCount or wc
Keep lines under 80 characters (use line breaks inside functions)
Add spaces around operators: x <- 5 * (3 + 2) not x<-5*(3+2)
Load all packages at the top of the script
Set the random seed at the top when using random processes: set.seed(42)

Project Structure

Always work inside an R Project (.Rproj)
Use here::here() for all file paths — never hardcode absolute paths like "C:/Users/Martin/..."
Keep raw data read-only — never overwrite original files; save processed versions separately
Use version control (Git) for anything important

Reproducibility

Write all analyses in R Notebooks or scripts — never rely on Console-only work
Render your notebook from scratch periodically to confirm it runs end-to-end
End every notebook with sessionInfo() to record package versions
Consider using renv to snapshot your package environment

Environment Hygiene

Code

# See all objects in your environment  
ls()  
  
# Remove a specific object  
rm(my_temp_variable)  
  
# Remove everything (use with caution!)  
rm(list = ls())  
  
# Check working directory  
getwd()  
  
# Change working directory (prefer R Projects over setwd())  
setwd("path/to/folder")  # avoid this; use R Projects instead

Citation & Session Info

Schweinberger, Martin. 2026. Getting Started with R and RStudio. Brisbane: The University of Queensland. url: https://ladal.edu.au/tutorials/intror/intror.html (Version 2026.02.19).

@manual{schweinberger2026intror,  
  author       = {Schweinberger, Martin},  
  title        = {Getting Started with R and RStudio},  
  note         = {https://ladal.edu.au/tutorials/intror/intror.html},  
  year         = {2026},  
  organization = {The University of Queensland, Australia. School of Languages and Cultures},  
  address      = {Brisbane},  
  edition      = {2026.02.19}  
}

Code

sessionInfo()

R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Australia/Brisbane
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] checkdown_0.0.13 flextable_0.9.7  here_1.0.1       tokenizers_0.3.0
 [5] tm_0.7-16        NLP_0.3-2        readxl_1.4.3     quanteda_4.2.0  
 [9] tidytext_0.4.2   lubridate_1.9.4  forcats_1.0.0    stringr_1.5.1   
[13] dplyr_1.2.0      purrr_1.0.4      readr_2.1.5      tidyr_1.3.2     
[17] tibble_3.2.1     ggplot2_4.0.2    tidyverse_2.0.0 

loaded via a namespace (and not attached):
 [1] fastmatch_1.1-6         gtable_0.3.6            xfun_0.56              
 [4] htmlwidgets_1.6.4       lattice_0.22-6          tzdb_0.4.0             
 [7] vctrs_0.7.1             tools_4.4.2             generics_0.1.3         
[10] parallel_4.4.2          janeaustenr_1.0.0       pkgconfig_2.0.3        
[13] Matrix_1.7-2            data.table_1.17.0       RColorBrewer_1.1-3     
[16] S7_0.2.1                uuid_1.2-1              lifecycle_1.0.5        
[19] compiler_4.4.2          farver_2.1.2            textshaping_1.0.0      
[22] codetools_0.2-20        litedown_0.9            fontLiberation_0.1.0   
[25] fontquiver_0.2.1        SnowballC_0.7.1         htmltools_0.5.9        
[28] yaml_2.3.10             pillar_1.10.1           openssl_2.3.2          
[31] fontBitstreamVera_0.1.1 commonmark_2.0.0        stopwords_2.3          
[34] zip_2.3.2               tidyselect_1.2.1        digest_0.6.39          
[37] stringi_1.8.4           slam_0.1-55             labeling_0.4.3         
[40] rprojroot_2.0.4         fastmap_1.2.0           grid_4.4.2             
[43] cli_3.6.4               magrittr_2.0.3          withr_3.0.2            
[46] gdtools_0.4.1           scales_1.4.0            timechange_0.3.0       
[49] officer_0.6.7           rmarkdown_2.30          cellranger_1.1.0       
[52] ragg_1.3.3              askpass_1.2.1           hms_1.1.3              
[55] evaluate_1.0.3          knitr_1.51              markdown_2.0           
[58] rlang_1.1.7             Rcpp_1.0.14             glue_1.8.0             
[61] xml2_1.3.6              renv_1.1.1              rstudioapi_0.17.1      
[64] jsonlite_1.9.0          R6_2.6.1                systemfonts_1.2.1

Back to HOME

Introduction

Why R?

Preparation and Session Set-up

Installing R and RStudio

Installing R

Installing RStudio

The RStudio Interface

Pane 1: Script Editor (top left)

Pane 2: Console (bottom left)

Pane 3: Environment and History (top right)

Pane 4: Files, Plots, Help, Packages (bottom right)

Projects and Notebooks

Step 1: Create a Project Folder

Step 2: Create an R Project

Step 3: Create an R Notebook

R Markdown Basics

R Fundamentals

Setting Up a Session

Objects and Assignment

Functions

Operators

Data Types

Data Structures

Vectors

Sequences and Repetitions

Factors

Data Frames

Lists

Indexing and Subsetting

Indexing Vectors

Indexing Data Frames

Subsetting with dplyr

Working with Data

Loading Data

From CSV

From Excel

Saving Data

Manipulating Data with dplyr

mutate() — Add or Modify Columns

group_by() and summarise() — Aggregate by Group

arrange() — Sort Rows

rename() and relocate()

count() — Quick Frequency Tables

Handling Missing Values

Basic Visualisation with ggplot2

The Grammar of Graphics

Histograms

Boxplots

Bar Charts

Scatter Plots

Saving Plots

Getting Help

Help Within R

Vignettes

Reading Error Messages

Searching Online

Key Online Resources

Best Practices

Code Style

Project Structure

Reproducibility

Environment Hygiene

Citation & Session Info

References

Subsetting with `dplyr`

`mutate()` — Add or Modify Columns

`group_by()` and `summarise()` — Aggregate by Group

`arrange()` — Sort Rows

`rename()` and `relocate()`

`count()` — Quick Frequency Tables